Implement the new tuning API for `DeviceRleDispatch` by bernhardmgruber · Pull Request #7669 · NVIDIA/cccl

bernhardmgruber · 2026-02-13T15:06:21Z

Depends on:

Centralize delay_constructor policy helpers #7668
Split the RLE tuning header #7666
No SASS diff for cub.bench.run_length_encode.non_trivial_runs.base on SM 75;80;86;90;100
Refactoring

Fixes: #7532

copy-pr-bot · 2026-02-13T15:06:25Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

bernhardmgruber · 2026-02-13T17:07:55Z

There are substantial SASS diffs, for example in

Function : void cub::_V_300400_SM_750_800_860_900_1000::detail::rle::DeviceRleSweepKernel<cub::_V_300400_SM_750_800_860_900_1000::detail::rle::non_trivial_runs::policy_selector_from_types<long, cuda::std::__4::complex<float>>, cuda::std::__4::complex<float> const*, long*, long*, long*, cub::_V_300400_SM_750_800_860_900_1000::ReduceByKeyScanTileState<long, int, true>, cuda::std::__4::equal_to<void>, int, long, cub::_V_300400_SM_750_800_860_900_1000::detail::rle::streaming_context<cuda::std::__4::complex<float> const*, long, long>>(cuda::std::__4::complex<float> const*, long*, long*, long*, cub::_V_300400_SM_750_800_860_900_1000::ReduceByKeyScanTileState<long, int, true>, cuda::std::__4::equal_to<void>, int, int, cub::_V_300400_SM_750_800_860_900_1000::detail::rle::streaming_context<cuda::std::__4::complex<float> const*, long, long>)

on SM75.

bernhardmgruber · 2026-02-22T13:36:20Z

There are substantial SASS diffs, for example in [...] on SM75.

With the change from #7733, I asked Cursor:

The last two commits lead to SASS differences on the benchmark cub.bench.run_length_encode.non_trivial_runs.base for SM75. This was likely caused by the rewrite of the tuning information and dispatch layer of DeviceRleDispatch. Please find out why the SASS changed and fix the new tuning and dispatch code so the SASS change is gone. You are not allowed to revert to the old code.

and it indeed found the root cause. I am impressed.

bernhardmgruber · 2026-02-22T17:35:03Z

cub/cub/device/dispatch/tuning/tuning_rle_non_trivial_runs.cuh

+      // TODO(bgruber): I think we want `LengthT` instead of `int`
+      return make_default_policy(BLOCK_LOAD_DIRECT, sizeof(int), LOAD_LDG);


Retaining comment on old code

Fixes: NVIDIA#7532

github-actions · 2026-02-23T20:56:12Z

🥳 CI Workflow Results

🟩 Finished in 21h 20m: Pass: 100%/249 | Total: 9d 05h | Max: 3h 57m | Hits: 71%/153868

See results here.

github-project-automation bot added this to CCCL Feb 13, 2026

github-project-automation bot moved this to Todo in CCCL Feb 13, 2026

cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Feb 13, 2026

bernhardmgruber mentioned this pull request Feb 13, 2026

Implement the new tuning API for Dispatch[Streaming]ReduceByKey #7667

Merged

6 tasks

bernhardmgruber force-pushed the tuning_rle_non_trivial branch from 9b20ea4 to 01f7ef3 Compare February 22, 2026 12:48

bernhardmgruber marked this pull request as ready for review February 22, 2026 12:55

bernhardmgruber requested review from a team as code owners February 22, 2026 12:55

bernhardmgruber requested review from NaderAlAwar and shwina February 22, 2026 12:55

cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Feb 22, 2026

This comment has been minimized.

Sign in to view

bernhardmgruber commented Feb 22, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

bernhardmgruber added 7 commits February 23, 2026 00:33

Implement the new tuning API for DeviceRleDispatch

45a669b

Fixes: NVIDIA#7532

Fix input type

8c3d912

Add reduce_by_key delay constructor

0ce0058

no SASS diff

2ae7a68

Refactoring and fixes

5f25041

Try to work around MSVC

8a5bb4d

MSVC

490d98b

bernhardmgruber force-pushed the tuning_rle_non_trivial branch from 5d53482 to 490d98b Compare February 22, 2026 23:33

This comment has been minimized.

Sign in to view

NaderAlAwar approved these changes Feb 23, 2026

View reviewed changes

bernhardmgruber enabled auto-merge (squash) February 23, 2026 20:26

bernhardmgruber merged commit c3161ff into NVIDIA:main Feb 23, 2026
527 of 531 checks passed

bernhardmgruber deleted the tuning_rle_non_trivial branch February 23, 2026 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the new tuning API for `DeviceRleDispatch`#7669

Implement the new tuning API for `DeviceRleDispatch`#7669
bernhardmgruber merged 7 commits intoNVIDIA:mainfrom
bernhardmgruber:tuning_rle_non_trivial

bernhardmgruber commented Feb 13, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Feb 13, 2026

Uh oh!

bernhardmgruber commented Feb 13, 2026

Uh oh!

bernhardmgruber commented Feb 22, 2026

Uh oh!

This comment has been minimized.

bernhardmgruber Feb 22, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Feb 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// TODO(bgruber): I think we want `LengthT` instead of `int`
		return make_default_policy(BLOCK_LOAD_DIRECT, sizeof(int), LOAD_LDG);

Conversation

bernhardmgruber commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Feb 13, 2026

Uh oh!

bernhardmgruber commented Feb 13, 2026

Uh oh!

bernhardmgruber commented Feb 22, 2026

Uh oh!

This comment has been minimized.

bernhardmgruber Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Feb 23, 2026

🥳 CI Workflow Results

🟩 Finished in 21h 20m: Pass: 100%/249 | Total: 9d 05h | Max: 3h 57m | Hits: 71%/153868

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bernhardmgruber commented Feb 13, 2026 •

edited

Loading